Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks for some common workloads #243

Merged
merged 22 commits into from
Aug 17, 2022
Merged

Benchmarks for some common workloads #243

merged 22 commits into from
Aug 17, 2022

Conversation

gjoseph92
Copy link
Contributor

Closes #191

This adds all benchmarks from dask/distributed#6560 (comment).

They have been slightly rewritten, so the tests pick their data size automatically as a factor of total cluster memory. This is intended to make #241 significantly easier: by just parametrizing over different clusters, we can test both strong and weak scaling of the same tests.

This also adds a couple of utility functions, scaled_array_shape and timeseries_of_size, to help writing dynamically-scalable tests like this. I found that utilities like this make translating real-world examples into benchmark tests significantly more pleasant: you just look at real_world_data_size / real_world_cluster_size, multiply that factor by cluster_memory(client), and pass the target data size into one of the helper functions.

Copy link
Contributor

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You previously expressed interest in restructuring some of the test layout -- do we still want to rip that bandaid off?

tests/workloads/test_dataframe.py Outdated Show resolved Hide resolved
tests/workloads/test_dataframe.py Outdated Show resolved Hide resolved
tests/utils_test.py Show resolved Hide resolved
tests/workloads/test_array.py Outdated Show resolved Hide resolved
@gjoseph92
Copy link
Contributor Author

You previously expressed interest in restructuring some of the test layout -- do we still want to rip that bandaid off?

We probably do, but let's do it in a separate PR. I've managed to organize these in a way I'm satisfied with (they're separate from all the other benchmarks right now), but it might be good to eventually mix some of the other ones into these?

@ian-r-rose do you think this is ready to merge? The failed tests look like running out of AWS vCPU quota and things like that, not actual failures (so it's hard to know).

@ian-r-rose
Copy link
Contributor

@ian-r-rose do you think this is ready to merge? The failed tests look like running out of AWS vCPU quota and things like that, not actual failures (so it's hard to know).

Yes, I think this is ready, except they are not being run right now (unless I'm missing something?). The test workflow currently runs the different directories separately in their own jobs, so nobody is running the workflows directory. I'd like to discuss changing that (probably in the same refactor as we were discussing above), but for now we should probably just give workflows their own job.

Seems easier than setting up yet another github action to run `workloads`
Another thing you didn't have to worry about with package_sync!
@gjoseph92
Copy link
Contributor Author

Okay, I've gotten everything working except for tests/benchmarks/test_array.py::test_climatcic_mean.

When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much.

However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died.

Screen Shot 2022-08-16 at 7 53 06 PM

I think this demonstrates a real failure case. This workload should work in the allotted time, but it doesn't in a real-world case (when other things have run before it).

What do we want to do here? Merge it skipped? How do we want to handle integration tests for things that are currently broken? It's important not to forget about them, and also valuable to see when exactly they get fixed. If we skip it, I can try to remember to manually test it and un-skip it when something gets merged upstream that resolves it, but that's easy to mess up, and seeing the clear history gives us both traceability and a good story.


Theories for why it's failing:

  • On distributed==2022.6.0, MALLOC_TRIM_THRESHOLD_ hasn't been set yet by default. That might make the difference. Note though that the test passes even without it being set, if it's run on a fresh cluster. So that's clearly not the only problem. Plus, we're client.restart()-ing the workers before every test, so the workers should be in the same brand-new state regardless of whether the test is run on its own, or after others. However, client.restart() doesn't do that much to the scheduler, so maybe that's where the problem is.

  • We've know that every subsequent time you submit a workload to the scheduler, it runs slower and slower, and scheduler memory grows and grows: Are reference cycles a performance problem? dask/distributed#4987 (comment). (There's no reason to think things have changed since that research last year.)

    As the scheduler gets sluggish, it will be slower to both tell workers about data-consuimg downstream tasks to run (instead of the data-producing root tasks they've already been told to run), and it will be slower to allow them to delete keys that are completed and aren't needed anymore. Note that just because a worker runs a downstream task (like writing a chunk to zarr) doesn't mean the worker gets to immediately release the upstream data—it must be explicitly told by the scheduler to do so. If the scheduler is slow, the worker will go load even more data into memory while keeping around the chunks that have already been written to zarr and should have been released.

    Thus we see the double-whammy of root task overproduction: as soon as the delicate balance of scheduler latency is thrown off, workers will simultaneously produce memory faster than they should, and release memory slower than they should:

    Basically, I think this will only be fixed by Withhold root tasks [no co assignment] dask/distributed#6614, or by understanding and fixing whatever's causing the scheduler to slow down (which is further out) Are reference cycles a performance problem? dask/distributed#4987.

@ntabris
Copy link
Member

ntabris commented Aug 17, 2022

With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it

@gjoseph92 any chance you have the cluster id or a link to the details page?

@gjoseph92
Copy link
Contributor Author

@ntabris sadly no, I closed that tab too long ago. And AFAICT there are no logging statements in the tests that allow you to see which cluster a test was running on, and the cluster names in tests are random.

@ntabris
Copy link
Member

ntabris commented Aug 17, 2022

AFAICT there are no logging statements in the tests that allow you to see which cluster a test was running on, and the cluster names in tests are random

Hm, that might be good to change.

Regardless I found it based on the public IP for the scheduler (shown in your screenshot!).

https://cloud.coiled.io/dask-engineering/clusters/51386/details

I might poke around a little at the logs tomorrow (and anyone else is of course welcome to as well).

@ntabris
Copy link
Member

ntabris commented Aug 17, 2022

Oh, the first worker I checked shows over 2000 lines of solid:

distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.09 GiB -- Worker memory limit: 7.11 GiB

It's around 10/s for about 5 minutes.

@ian-r-rose
Copy link
Contributor

Thanks for writing this up @gjoseph92. I'm trying this out right now, and was able to reproduce some of this locally (well, not locally, but you know what I mean). If I run test_climatcic_mean by itself, it does okay, but if I run the whole module things start to go badly in the same way you describe.

I'm still looking into this, but I wanted to note that I looked at the scheduler system tab while things were going poorly, and didn't see anything overly concerning:
image

@ian-r-rose
Copy link
Contributor

I think this demonstrates a real failure case. This workload should work in the allotted time, but it doesn't in a real-world case (when other things have run before it).

What do we want to do here? Merge it skipped? How do we want to handle integration tests for things that are currently broken? It's important not to forget about them, and also valuable to see when exactly they get fixed. If we skip it, I can try to remember to manually test it and un-skip it when something gets merged upstream that resolves it, but that's easy to mess up, and seeing the clear history gives us both traceability and a good story.

I like the idea of having a good history of memory usage, duration, etc, even for failing tests. I would note, however, that the measurements for tests that don't finish aren't really indicative of anything real. The duration would just show the timeout until things were fixed. The memory metrics would be a bit more meaningful, but I'd still hesitate to interpret them. So I might still advocate for skipping the problematic tests until we can figure out what's going on.

gjoseph92 and others added 2 commits August 17, 2022 12:33
Co-authored-by: Ian Rose <ian.r.rose@gmail.com>
@gjoseph92
Copy link
Contributor Author

That is interesting that the scheduler looks okay. For now, I've just skipped it. We can talk about what to do in general on #252, and for the specific test on #253.

@ian-r-rose
Copy link
Contributor

This is all green -- anything else you'd like to do here before we merge @gjoseph92?

@gjoseph92
Copy link
Contributor Author

Nope, I think this is good! Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initial set of automated performance benchmarks (non-H2O)
4 participants